Testing Multiple Hypotheses

Master 1 CSILS/CARE - Numerical methods

Bertrand Servin

Hypothesis Testing

  • Formulate a null hypothesis \(H_0\) (\(\gamma = 0\))
  • Compute a statistic that follows a known distribution under the null hypothesis \[t=\frac{\hat{\gamma}}{SE(\hat{\gamma})} \sim \mathcal{T}(n-2)\] for the slope of a linear regression
  • Compute the probability that a value \(\geq |t|\) would have been obtained under the null hypothesis = p-value
  • Reject the Null \(H_0\) if p-value is less than some threshold
  • What threshold should I use ? what threshold should I use if I test a lot of hypothesis ?

Type I and Type II errors

Hypothesis testing is about making decision about the null hypothesis

Truth / Decision H0 H1
H0 Correct Type I Error (\(\alpha\))
H1 Type II Error (\(\beta\)) Correct
  • \(\alpha\) is the probability of type I Error (False Positive Rate). This is the criterion typically used to declare statistical significance. Usually set at 0.05.,
  • \(\beta\) is the probability of making a type II Error (False Negative Rate) 1
  • Type I and Type II errors are strongly linked.
  • If we are ready to take more risks (increase type I Error), we can call more significant results

Type I and Type II errors: Experiment

Assume the true value of my parameter \(\gamma\) is 2. Assume its standard error is 1 and the degrees of freedom of the model is 20. My estimator \(\hat{\gamma}\) follows approximately a t-distribution with a non-centrality parameter (ncp) = \(\gamma=2\)

  1. Create a function that takes as input a \(t\) value and a number of degrees of freedom and outputs the p-value of the corresponding \(t\)-test for the null \(\gamma = 0\). Use the function pt from R.
  2. Using the R function rt, simulate 10,000 values for \(\hat{\gamma}\) (corresponding to the performing 10,000 independent experiments)
  3. Draw an histogram of the simulated values, and compare it to the null distribution
  4. Compute the corresponding p-values using the function you just created
  5. plot the proportion of significant tests as a function of the type I Error ranging from 0.001 to 0.5

R Code

Code
par(mfrow=c(1,2))
my_df = 20
gamma = 2
pvalue = function(t, ddl) {
  return(2*pt(t,df=ddl,lower.tail=FALSE))
}
gamma_hat = rt(10000,df = my_df, ncp=gamma)
hist(gamma_hat,freq=F,xlim=c(-10,10),main='')
xx.g = seq(-10,10,0.01)
lines(xx.g,dt(xx.g,df=my_df),col=2)
pv = pvalue(gamma_hat, ddl=my_df)
type.1.err = seq(0.00,0.5,0.001)
power = c()
for (th in type.1.err) {
  power = c(power, mean(pv<th))
}
plot(type.1.err,power,type='l', xlab='type I Error', ylab='Power')

Type I Error and Multiple tests

Say we perform \(m\) hypothesis tests, what is the probability of at least 1 false positive ?

  • P(making an error for one test) = \(\alpha\)
  • P(Not making an error for one test) = \((1-\alpha)\)
  • P(Not making an error for m test) = \((1-\alpha)^m\)
  • P(Making at least one error in m tests) = \(1 - (1-\alpha)^m\)

Write an R function that computes the probability of making at least one error, as a function of the number of tests \(m\), and the risk \(\alpha\). Plot the results for \(m\) between 1 and 100 and \(\alpha=0.05\)

Multiple Testing problem : R Code

Code
fpr = function(ntests, alpha) {
  1 - (1-alpha)^ntests
}

m = seq(1,100)
plot(m,fpr(m,alpha=0.05),type='l',xlab="Number of tests", ylab="False Positive Rate")

When performing multiple tests, if we do not adjust our Type I Error level, the overall false positive rate increases very fast.

Counting Errors

Not Significant Significant Total
\(H_0\) U V1 \(m_0\)
\(H_1\) T2 S \(m_1\)
\(m-R\) \(R\) \(m\)

When we perform many (\(m\)) tests, we may want to control the amount of errors in the test we declare significant

  1. If an error is costly (e.g. experimental validation or big claim), we want \(P(V\geq 1)\) to be small = Family Wise Error Rate
  2. If a few errors are acceptable, we can increase the number of significant decisions by controlling \(\frac{V}{R}\) = False Discovery Rate

Control of the FWER: Bonferroni correction

  • To control he FWER at a level \(\alpha\) (FWER \(\leq \alpha\)), we need to adjust our significant threshold to the number of tests performed.
  • The Bonferroni correction consists in adjusting the significant threshold to \(\frac{\alpha}{m}\)
  • For example if we tests 100 hypothesis and want to control the FWER at 5% we will only call significant p-vales \(< 0.05/100 = 5 \times 10^{-4}\)
  • Problem : statistical power is greatly affected
  • Possible solution : control the False Discovery Rate

Counting Errors

Not Significant Significant Total
\(H_0\) U V \(m_0\)
\(H_1\) T S \(m_1\)
\(m-R\) \(R\) \(m\)

\[ FDR = \frac{{V}}{R}\]

Question : how do we estimate FDR ?

Benjamini and Hochberg FDR

Consider the single testing problem

recall that under \(H_0\) p-values (\(p\)) are uniformly distributed on \([0,1]\)

\(FDR = P(H_0 | p \leq \theta) = \frac{P(H_0 \& p \leq \theta)}{P(p \leq \theta)}\)

\(FDR = \frac{P(p\leq \theta | H_0) P(H_0)}{P(p\leq \theta| H0)P(H_0) + P(p \leq \theta|H_1)P(H_1)}\)

\(FDR = \frac{\pi_o\theta}{\pi_o\theta + (1-\pi_o)F_1(\theta)}\)

\(FDR = \frac{\pi_0\theta}{F(\theta)}\)

where \(F\) denotes a cumulative distribution function (CDF)

\(F_X(x) = P(X \leq x)\). To control FDR at level \(\alpha\) we want to find the threshold \(\theta\) such that \[\frac{\pi_0\theta}{F(\theta)}=\alpha\]

Benjamini and Hochberg FDR

\[\frac{\pi_0\theta}{F(\theta)} = \alpha \iff F(\theta) = \frac{\pi_0}{\alpha}\theta\]

  • On the right hand side : CDF, on the left hand side : slope.
  • We have two unknown quantities to deal with : \(F(\theta)\) and \(\pi_0\).
  • Multiple tests \(\rightarrow\) empirical estimation of \(F(\theta)\)
  • Set \(\pi_0 = 1\) results in a conservative estimate of FDR (true FDR is \(\leq \alpha\))

BH Procedure

Given a list of p-values (multiple tests)

  1. Sort the p-values in ascending order \(P_1 \leq P_2 \leq ... \leq P_m\)
  2. Find the first p-value \(P_j\) s.t. \(P_j > \frac{\alpha}{m}j\)
  3. Call significant \(P_1, P_2 ... P_{j-1}\)

BH FDR : R exercise

  1. Create a vector of \(m=10\) pvalues = (0.015, 0.025, 0.033, 0.003, 0.055, 0.047, 0.009, 0.042, 0.013, 0.022)
  2. Using the Bonferroni correction at \(FWER = 0.05\) which tests will be called significant ?
  3. Using the ecdf R function, plot the Empirical Distribution Function of the p-values, and add the line of slope \(1/FDR\) for FDR =0.05
  4. Plot the sorted p-values in order and draw the line of slope \(\alpha/m\)
  5. Which tests are called significant with \(FDR \leq 0.05\) ?
  6. Verify using the p.adjust function and the method “BH”

R Code

Code
par(mfrow=c(1,3))
pv = c(0.015, 0.025, 0.033, 0.003, 0.055, 0.047, 0.009, 0.042, 0.013, 0.022)
m = length(pv)
pv.adj = pv[pv<0.05/m]

plot(ecdf(pv))
abline(coef=c(0,1/0.05))

plot(sort(pv),pch=16,xlab= "rank", ylab="p-value")
abline(coef=c(0,0.05/m))

qv = p.adjust(pv,method='BH')
plot(pv,qv,xlab='pvalue', ylab= 'FDR',pch=19)
abline(h=0.05)

Conclusions

  • Today, experimental data are larger and larger (genomics, sensors, … )
  • Very large number of hypotheses tested
  • Adjusting for multiple testing is mandatory for ensuring the validity of conclusions
  • Usually simple procedures, implemented in R functions or packages